Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add guidance for CO HDF/NetCDF #121

Merged
merged 19 commits into from
Dec 20, 2024
Merged

Conversation

abarciauskas-bgse
Copy link
Contributor

@abarciauskas-bgse abarciauskas-bgse commented Nov 13, 2024

Adds long overdue and much requested guidance on cloud-optimizing HDF(5) and NetCDF(-4).

I've added as co-authors @ajelenak and @ashiklom but also cited @bilts @betolink and @andypbarrett, so tagging all for review.

Copy link

github-actions bot commented Nov 14, 2024

PR Preview Action v1.4.8
🚀 Deployed preview to https://cloudnativegeo.github.io/cloud-optimized-geospatial-formats-guide/pr-preview/pr-121/
on branch gh-pages at 2024-12-20 21:21 UTC

@abarciauskas-bgse abarciauskas-bgse marked this pull request as ready for review November 15, 2024 00:07
@wildintellect
Copy link
Contributor

@abarciauskas-bgse this is a great 1st version, a few questions and suggested fixes

  • Fix: Compression - is currently a subheading under Consolidated Metadata
  • Q: When talking about optimum chunk size is this compressed? Since compressed chunks should be delivered, I would think you want to target compressed sizes.
  • Q: Additional Research, Chuck's example was on a non-cloud-opt HDF5, that's probably important.
  • Fix: "How to check chunk size and shape" is missing output and explanation of how to read the output
  • Q: Do we want to reference Zarr/Chunking in some way as alternatives, Zarr for when cloud native is fine and you don't need a single "archival file", Chucking (e.g. Kerchunk) when you want an index around and existing file you don't want to or can't change.

TODO: We'll open a different ticket for a notebook page about writing files from Python etc... rather than always having to repack existing files all the time.

Copy link
Contributor

@wildintellect wildintellect left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of fixes #121 (comment)

@betolink
Copy link

This looks good! just some minor additions that I'm not sure are relevant for a first pass on this.

  • As of now, page size is across the board, this means that a user will have to find a balanced page size that reduces the metadata requests to the minimum at the same time that the unused size in the data chunks do not increase the file size by a lot... we noticed this for IS2 ATL03; example: file size 6GB, total metadata: ~20MB, if we use 8MB pages, the file size will increase by 1% approx, but for smaller files e.g. <1GB the 8MB page size increased the file size by ~10%. and this % varies depending on the page size vs data chunk size ratios. I short, a user should be careful to pick a page size as it's dataset dependent.
  • The official HDF5 library needs to be configured to use the page aggregated files, I think @ajelenak said that this will change in March when the HDF Group releases the next major version.
  • HDF5 doesn't have a geo spec for spatial chunking, at the lowest level if a user needs to subset data, the HDF5 library will have to load all the chunks of a dataset to create an index and use it to subset (e.g. lat lon subsetting), in order to take CO-HDF5 to the next level each chunk in the file should have a polygon/bbox info and it should be indexed in a way that the drivers can understand. This is related to over-reads, e.g. our data chunks are ~1MB per chunk, we use 8MB pages and in a subsetting operation we only need 2 chunks from contiguous pages... will be requesting 16MB instead of 2MB. @ajelenak can confirm if this is true. and @bilts also mentioned it on his ESIP talk.
  • On creating vs repacking:

If I can think of more stuff I'll add it later (will be out next week). We also need to finish our tech report on IS2 and CO-HDF5, I think it should be ready for AGU.

@abarciauskas-bgse
Copy link
Contributor Author

@betolink @wildintellect thanks for the feedback. I have some AGU prep to do but once that is done I will address the comments.

@abarciauskas-bgse
Copy link
Contributor Author

abarciauskas-bgse commented Dec 16, 2024

@betolink thank you so much for these detailed comments. I have some comments and questions I'm hoping to help me sort out the details...

As of now, page size is across the board, this means that a user will have to find a balanced page size that reduces the metadata requests to the minimum at the same time that the unused size in the data chunks do not increase the file size by a lot... we noticed this for IS2 ATL03; example: file size 6GB, total metadata: ~20MB, if we use 8MB pages, the file size will increase by 1% approx, but for smaller files e.g. <1GB the 8MB page size increased the file size by ~10%. and this % varies depending on the page size vs data chunk size ratios. I short, a user should be careful to pick a page size as it's dataset dependent.

I'm reading a bit more documentation and now I am confused - so I'm hoping to clarify. Using h5repack it appears there is just one FS_PAGESIZE argument that can be set, to be used in combination with FS_STRATEGY=PAGE. But then I found in the hdf5 library documentation there is both H5Pset_small_data_block_size and H5Pset_meta_block_size which leads me to believe you can set the metadata page size separately from the raw data block size (and more specifically, it's a block size for "small blocks", so I'm assuming that just means when multiple raw datasets can fit into one block). Do you know if using h5repack it uses the FS_PAGESIZE for both metadata and small raw data datasets?

Secondly, I think if I had this level of detail I should also clarify that pages are different from chunks. If I understand correctly, HDF5 will create "pages" of data using the page size but then the raw data itself could also be chunked, so presumably the chunks will always be smaller than the page sizes. Is this a correct understanding?

Also, are we concerned about increases in file size purely from a cost for storage perspective? My understanding was that for performance total file size doesn't matter, as long as we can just grab reasonably sized chunks from the file.

The official HDF5 library needs to be configured to use the page aggregated files, I think @ajelenak said that this will change in March when the HDF Group releases the next major version.

I see in the hdf5 library there is H5Pset_page_buffer_size and in h5py you can set page_buf_size - are you saying these arguments are working to use the page aggregated files or that using this page buffer size setting is not set?

HDF5 doesn't have a geo spec for spatial chunking, at the lowest level if a user needs to subset data, the HDF5 library will have to load all the chunks of a dataset to create an index and use it to subset (e.g. lat lon subsetting), in order to take CO-HDF5 to the next level each chunk in the file should have a polygon/bbox info and it should be indexed in a way that the drivers can understand. This is related to over-reads, e.g. our data chunks are ~1MB per chunk, we use 8MB pages and in a subsetting operation we only need 2 chunks from contiguous pages... will be requesting 16MB instead of 2MB. @ajelenak can confirm if this is true. and @bilts also mentioned it on his ESIP talk.

This is interesting but I'm still trying to understand it and so not sure how to include it in a way that is useful to readers. I think for the purposes of this first draft I will omit it if that's ok with you.

On creating vs repacking:

I'll add these in as notes.

If I can think of more stuff I'll add it later (will be out next week). We also need to finish our tech report on IS2 and CO-HDF5, I think it should be ready for AGU.

Is the tech report out? If so I will link to it for sure.

@ashiklom
Copy link

Chiming in, since I'm actively working on applying this guidance to some next-generation data products from GMAO. Others, please correct me if I'm wrong!

My mental model of paged aggregation is that, when enabled, it's basically the smallest unit of data that HDF5 can read or write; i.e., you can't read or write part of a page. All the consequences of inappropriately set page sizes flow from that.

believe you can set the metadata page size separately from the raw data block size (and more specifically, it's a block size for "small blocks", so I'm assuming that just means when multiple raw datasets can fit into one block). Do you know if using h5repack it uses the FS_PAGESIZE for both metadata and small raw data datasets?

I've never seen any HDF5 person mention different page sizes for metadata vs. (chunked) data. I think the two things you're linking to here refer to a different (not page-based) storage management strategy. But, it would be awesome of HDF5 could have more flexible page sizes!

If I understand correctly, HDF5 will create "pages" of data using the page size but then the raw data itself could also be chunked, so presumably the chunks will always be smaller than the page sizes. Is this a correct understanding?

My understanding is: Chunk sizes should be smaller than page sizes, but I don't think it's required; you can split chunks across multiple pages. Otherwise, HDF5's tiny default page size (4 KB?) would fail for most datasets.

Also, are we concerned about increases in file size purely from a cost for storage perspective? My understanding was that for performance total file size doesn't matter, as long as we can just grab reasonably sized chunks from the file.

My guess is that there might be a minor performance penalty for retrieving unused data (because you have to download/read more data than you actually need), but it'll be negligible in most cases. So yes, the primary concern with large page sizes is that they inflate overall file size (and therefore storage cost). But, since lots of NASA data are big, that's a very important consideration! A 10% increase in NASA's ~140 PB catalog is ~14 PB, which is multiple big missions'-worth of data!

@abarciauskas-bgse
Copy link
Contributor Author

@ashiklom thank you so much for chiming in! These thoughts are super helpful and interested to know how the GMAO product development goes.

My mental model of paged aggregation is that, when enabled, it's basically the smallest unit of data that HDF5 can read or write; i.e., you can't read or write part of a page. All the consequences of inappropriately set page sizes flow from that.

That is a helpful simplification, thank you.

I've never seen any HDF5 person mention different page sizes for metadata vs. (chunked) data. I think the two things you're linking to here refer to a different (not page-based) storage management strategy. But, it would be awesome of HDF5 could have more flexible page sizes!

I think you're right, that these API methods are for a different file space management strategy.

My understanding is: Chunk sizes should be smaller than page sizes, but I don't think it's required; you can split chunks across multiple pages. Otherwise, HDF5's tiny default page size (4 KB?) would fail for most datasets.

👍🏽

My guess is that there might be a minor performance penalty for retrieving unused data (because you have to download/read more data than you actually need), but it'll be negligible in most cases. So yes, the primary concern with large page sizes is that they inflate overall file size (and therefore storage cost). But, since lots of NASA data are big, that's a very important consideration! A 10% increase in NASA's ~140 PB catalog is ~14 PB, which is multiple big missions'-worth of data!

👍🏽

I have incorporated most of these comments into a new box HDF5 File Space Management Strategies

@betolink
Copy link

I concur with all the things @ashiklom said.

If chunk sizes are larger than page sizes they will be tracked separately, so page aggregation won't be applied to them, which is bad. I want to dive into a geo-spec for HDF5, how can we rechunk different collections and add the geo-metadata to improve access even more, something I talked to Aleksandar and Patrick.

The technical report on ATL03 is almost there, I think I'll use the holidays to finish it. I'm not sure about funding yet but after talking to Brianna(NASA) I think the Cloud Native summit in April would be a great place to present it.

@abarciauskas-bgse
Copy link
Contributor Author

@betolink thank you for sharing the tech report, it looks great.

Just one question:

If chunk sizes are larger than page sizes they will be tracked separately, so page aggregation won't be applied to them.

Do you mean that there will be both metadata on pages AND chunks? And why is this bad, besides an increase in metadata? Is it because of chunk over-reading when reading multiple pages? Sorry this is the first time I'm hearing about this and curious about how it works. In the technical report it says "Chunk sizes cannot be larger than the page size", which seems contradictory to what we are discussing here (that chunk size can be larger than page sizes but it slows things down).

@abarciauskas-bgse
Copy link
Contributor Author

@wildintellect ok I have incorporated comments to date. I am happy to merge and publish and we can update with new feedback as it arrives.

@wildintellect wildintellect self-requested a review December 19, 2024 16:53
Copy link
Contributor

@wildintellect wildintellect left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only have 1 minor question. Is it bad to drop all references to alternatives to Cloud Optimizing (aka services) like Hyrax, OpenDap etc? Should we be saying why we think cloud optimized is better but that these do exist as alternatives?

@betolink
Copy link

I only have 1 minor question. Is it bad to drop all references to alternatives to Cloud Optimizing (aka services) like Hyrax, OpenDap etc? Should we be saying why we think cloud optimized is better but that these do exist as alternatives?

Good question, I see cloud native formats as a better long-term solution to transformation services, although these services are needed in some cases, they too will benefit of the data being in cloud optimized formats.

"Chunk sizes cannot be larger than the page size", which seems contradictory to what we are discussing here

They can be larger but the driver won't access the chunks using the single page size approach, they will be as if they were in a regular HDF5, not bad if chunk sizes are really large. We only ran into one case for ICESat-2, the page size being 8MB and a dataset had 10 MB chunks, the smaller chunks were grouped into pages, the 10 MB were not. Since they are big enough the performance was not degraded. I think it was one of the 2 atmospheric datasets.

@abarciauskas-bgse
Copy link
Contributor Author

We actually do cover services generally in the home page:

While it is possible to provide subsetting as a service, this requires ongoing maintenance of additional servers and extra network latency when accessing data (data has to go to the server where the subsetting service is running and then to the user). With cloud-optimized formats and the appropriate libraries, subsets of data can be accessed directly from an end user's machine without introducing an additional server.

But I think adding this sentence I just added to the introduction of the CO HDF5/NetCDF-4 strengthens the intro by providing a reason for cloud-optimizing:

Cloud-optimized formats provide efficient subsetting without maintaining a service, such as OpenDAP, Hyrax or SlideRule.

@abarciauskas-bgse
Copy link
Contributor Author

Thanks @betolink for helping out here - I hope you don't mind me pursuing this question about chunk sizes and page sizes. My reasoning may be wrong or I'm missing a scenario but I'm not sure I understand how having chunks larger than page sizes would degrade read performance, here are some scenarios:

  1. chunks size is a multiple of page size. This seems good and fine because 1 or more chunks can be grouped into 1 page (as long as pages aren't relatively too big compared with chunks, which could unnecessarily slow performance by reading 1 page that contains many chunks to read a few chunks).
  2. chunk size fits within a page but is not a multiple of page size - say you have 8MB pages and 5MB chunks. Is each 5MB chunk stored in 1 page, incurring lots of wasted space (very bad for file size)? Or are some chunks split into multiple pages, and then you have to load, say 16MB of pages to read 10MB of chunk data (kinda bad)? If chunks are split into multiple pages this seems inconsistent with the behavior in (3) that when chunk sizes are larger than page sizes, they are not split into pages.
  3. chunk size is larger than page size - as you indicated above these chunks would just not be grouped into pages and performance does not degrade, since presumably the chunks were just loaded in entirety, similar to pages.

@abarciauskas-bgse
Copy link
Contributor Author

@wildintellect I'm going to go ahead and merge and I can incorporate any add'l feedback from @betolink and @ajelenak as it comes. Thank you for reviewing it!

@abarciauskas-bgse abarciauskas-bgse merged commit ed64546 into staging Dec 20, 2024
3 checks passed
@abarciauskas-bgse abarciauskas-bgse deleted the add-co-hdf5-guidance branch December 20, 2024 21:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants